celeris v1.5.3 — core-engine performance milestone by FumingPower3925 · Pull Request #384 · goceleris/celeris

FumingPower3925 · 2026-06-21T13:59:02Z

Release PR for v1.5.3, the core-engine performance milestone (io_uring / epoll / adaptive).

Release-prep (this PR's tail)

Version 1.5.0 → 1.5.3 (was stale; even v1.5.2 shipped "1.5.0").
SECURITY.md updated for the 1.5.x supported line + a v1.5.x Security Improvements section (io_uring gen-tagged-CQE UAF hardening, H1 request-smuggling hardening, cpuMon + epoll-detach data-race fixes, middleware/secure COEP/X-Download-Options default-off behavior change, Go 1.26.4 toolchain bump).
Four middleware submodule require pins bumped v1.4.4 → v1.5.3.
probe: moved the CAP_SYS_NICE side-effect test under the linux build tag so the package compiles on darwin (local go test ./... was broken; CI unaffected).

Milestone highlights (the 38 commits)

Adaptive kernel/feature-gated start engine + lazy standby, effective AsyncHandlers for drivers + immediate-promote, dynamic-worker-scaler removal, io_uring/epoll memory-safety + race hardening, H1 parser hardening.

Merges fast-forward onto main (0 conflicts). After merge, the release is cut via gh release create v1.5.3 which fires release.yml (validate-tag → ci → tag-submodules → notify-proxy).

…nd was dead) Three wrong io_uring ABI constants silently disabled zero-copy send on every kernel: - opSENDZC was 53 (IORING_OP_FUTEX_WAITV); IORING_OP_SEND_ZC is 47. SEND_ZC SQEs carried the wrong opcode → kernel returned -EINVAL on the probe → SEND_ZC reported unsupported and was disabled everywhere. - cqeFNotif was 1<<2 (0x04 = IORING_CQE_F_SOCK_NONEMPTY); IORING_CQE_F_NOTIF is 1<<3 (0x08). Both the probe and the runtime notification handler (cqeIsNotif → handleSend) misread the zero-copy notification CQE. - probeSendZC hardcoded the 0x04 NOTIF check; now uses cqeFNotif. Also corrects the unused opSHUTDOWN constant (was 52=FUTEX_WAKE; IORING_OP_ SHUTDOWN is 34) to prevent the same class of bug if it is ever used. Verified on the cluster (kernel 7.0.0-22): the SEND_ZC probe now reports "true zero-copy" and the engine selects send_zc=true (previously send_zc=false / 'kernel rejected SEND_ZC opcode'). Note: perf-neutral on the current HTTP benchmark suite (interleaved A/B: get-json +0.4%, get-json-64k +0.0% — small responses don't use ZC's benefit, large are bandwidth-bound), but it is a genuine correctness fix that enables io_uring's zero-copy path and may help CPU-copy-bound workloads. Refs #356

Under Config.AsyncHandlers=true, a route that inherits the server/group default (not an explicit .Async()) is now ADAPTIVE: it runs INLINE on the event-loop worker (the ring-batched send path, which gets io_uring's ~33x syscall reduction) and is promoted to async dispatch only when an inline run is observed to block (>50us). Trivial handlers thus keep the cheap inline path; genuinely-blocking handlers (DB/cache round-trips) still get goroutine isolation after one inline run. Why: the shipped iouring-h1-async config dispatched EVERY request to a goroutine that did a direct write(2) per response, bypassing the ring (~42% CPU in unix.Write) — making it slower than epoll. Measured: on CPU-bound chain cells the ring path beats epoll +2.4..6.7%; this lets the async config reach that path for non-blocking handlers. - router: adaptiveRoutes set (built at registration) + promoted sync.Map; routeAsync returns async || (adaptive && promoted). Explicit .Async()/.Sync() removes the route from the adaptive set (setAsync). - handler: HandleStream times an inline adaptive run and promotes on block. - Non-adaptive configs (no AsyncHandlers default) keep the empty-map fast path, zero added overhead. Contract change (reflected in router_async_test.go): AsyncHandlers=true now means inline-first-adaptive, not all-async. Explicit .Async() is unchanged. Refs #356

…ode-fix fix(iouring): correct SEND_ZC opcode + CQE_F_NOTIF flag — zero-copy send was dead on every kernel

…nline feat(core): adaptive inline-first dispatch under AsyncHandlers (#356) — iouring-async +7.5..48% on CPU-bound

…routes Blocking driver routes (probatorium /cache, /db, /mc) register with an explicit .Async() on an async-default server. setAsync drops them from the adaptive set, so handler.go's inline-first gate skips them and they resolve hard-async — no inline window, no worker stall. Pin that contract so the driver no-regression guarantee can't silently break.

parseHeaders already detects an incomplete header block (a final line with no CRLF yields lineEnd==-1 -> (false,nil)), and every ParseRequest caller Reset()s parser+req first, so a partial parse is always retried cleanly. The upfront findHeaderEnd CRLF-walked the same bytes parseHeaders walks again -- pure double work on the common single-read request. Slow-drip re-parse is bounded by MaxHeaderSize(64K)/MaxHeaderCount(200)/ReadHeaderTimeout, matching net/http and fasthttp which also re-parse on partial reads. Correctness: full h1 race suite + 46M fuzz execs (ParseRequest+ChunkedBody) pass; findHeaderEnd retained (asm + its own tests).

…r-scan perf(h1): drop redundant upfront findHeaderEnd block scan (#359)

When a middleware stack pushes respHeaders past the inline 16-slot respHdrBuf (chain-fullstack: 17 headers), append() moved it onto a heap array that reset then dropped — forcing a fresh ~576B alloc every header-heavy request. Retain it as respHdrScratch and reuse it: one alloc per pooled Context, not per request. Also folds SetResponseHeaders' >16-header path onto the scratch and removes the old clear(respHdrBuf[:n>16]) clamp foot-gun (the cap check is the correct discriminator). NOTE: rps-neutral on the cluster (one small alloc/req was not the chain-fullstack bottleneck); this is a GC/RSS-hygiene + foot-gun-removal change. Verified 0 allocs/op (TestContextRespHeaderOverflowReuseZeroAlloc) + full root race suite + overflow regression guard.

#361) The #356 classifier timed every inline run of an adaptive route (two time.Now() vDSO calls + recordInlineRun) forever. A profile of iouring-h1-async chain-fullstack showed time.runtimeNow at 3.22% CPU on routes that never block. Add a settled-fast terminal state: after adaptiveSettleStreak (256) CONSECUTIVE fast inline runs a route is proven non-blocking and leaves the timed path (adaptiveLearning short-circuits on a single settled sync.Map.Load — the same lookup the prior isPromoted check cost, minus the two time.Now()). A slow run resets the fast streak; explicit .Async()/.Sync() clears settled (setAsync). A/B (iouring-h1-async, interleaved): chain-api +1.4% (both rounds), get-json +0.4%. Full root race suite + new TestRouteAsync_AdaptiveSettles + existing #356 hysteresis/promotion tests pass.

perf(core): settle non-blocking adaptive routes to stop per-req timing (#361)

The 50µs bar was below the CPU-jitter range: a transient GC/scheduling burst could push a CPU-bound middleware chain (chain-fullstack, ~20µs base) over 50µs for 8 consecutive runs and wrongly promote it to the slower async path, intermittently collapsing iouring-async chain-fullstack by ~32% for a whole run (#364). 300µs sits far above the CPU-bound range (even a heavy chain under jitter) while still catching genuinely-blocking handlers (sub-ms+) — which are marked .Async() in practice anyway. Hardening, not a proven fix: the collapse is rare (<10%, did not reproduce in 10 fresh-SUT runs) so elimination can't be directly measured. rps-neutral within the ~±2% A/B noise floor (chain-fullstack/api/get-json). Inherited adaptive routes are CPU-bound (never legitimately promote); the only behavior change is a moderate-latency unmarked-blocking route promotes later — such routes should use explicit .Async().

…reshold fix(core): raise adaptive promote threshold 50µs->300µs (#364 hardening)

…-reuse perf(core): reuse heap respHeaders backing across requests (#360)

ResetH1Stream re-zeroed Headers/LazyRawHeaders/Method/Path/Scheme/Authority/ IsHEAD/EndStream/ResponseWriter + an atomic state.Store(StateIdle) that populateCachedStream + handleH1Request unconditionally overwrite on the next request (h1.go:852-941, single caller) — pure dead work + a redundant atomic on every keep-alive request. Keep only the fields the caller does NOT redo: rawBody=nil (load-bearing — a bodyless GET must not inherit a prior body), lazyHeadersBuilt, pseudoMaterialized, headersSent. Verified: caller sets all 10 dropped fields; full h1/stream/internal-conn race suites + h1 conformance pass (a stale-state leak would surface across keep-alive requests); new TestResetH1StreamClearsUniqueFields guards the rawBody contract; 0 allocs/op.

perf(h1): drop dead-store writes in ResetH1Stream (#346)

ioUringBias is a heuristic that estimates io_uring's advantage from connection count + CPU pressure but NEVER reads the standby engine's measured throughput. Ungated, biasModeledStandbyScore could fabricate a standby score high enough to switch adaptive onto an engine that is measurably SLOWER on the live workload (e.g. epoll-favored get-simple). Gate it behind CELERIS_ADAPTIVE_IOURING_BIAS (default off): with the bias off, bias=0 → the modeled standby never exceeds the active score → adaptive switches only on MEASURED active degradation vs a previously-observed standby, never speculatively onto an unmeasured/slower engine. The speculative bias stays opt-in for re-validation. Tests: TestControllerOrganicSwitch now forces biasEnabled=true (the bias is opt-in); new TestControllerNoSpeculativeSwitchBiasOff asserts the default-off no-switch on the same sweet-spot workload; score_test uses the 2-arg form. Full adaptive suite passes on linux/amd64 (cluster).

…lt-off fix(adaptive): gate io_uring bias off by default (#341)

Set TCP_NODELAY once on the listen socket (Linux copies it onto every accepted socket at SYN time) and drop it from the per-accept sockopts.Options, removing one setsockopt syscall per accept on the hot path. rps-neutral on the cluster (churn-close within the ~±2% A/B noise; the bench is NIC-bound) — a syscall-count efficiency win, not a throughput change. Verified: full epoll suite passes; new linux tests assert the listen socket has TCP_NODELAY AND accepted conns inherit it (guards that removing the per-accept setsockopt does NOT silently re-enable Nagle on the hot path).

perf(epoll): inherit TCP_NODELAY from the listen socket (#337)

SEND_ZC adds a second (NOTIF) CQE per send and holds zcNotifPending across the buffer's DMA lifetime, stalling the next flush — a net loss on small payloads where the avoided memcpy is tiny. Gate it behind sendZCMinBytes=4096 via a single useSendZC(sendZC, linked, n) helper used at all four send sites (highTier + optionalTier PrepareSend, worker prepSendSQE): small/linked sends use plain SEND (1 CQE, immediate buffer reuse), large unlinked sends still use ZC. rps-neutral on the cluster (get-json/get-simple within ±2% noise — NIC-bound; get-json-64k confirms ZC still chosen for >=4096, no regression) → removes one CQE/req on small async sends (efficiency), not a throughput change. Link invariant preserved (linked sends never ZC). Full iouring suite + new gating tests (TestUseSendZC, TestPrepSendSQEGatesBySize, TestPrepSendSQELinkedNeverZC) pass on linux/amd64.

…e-gate perf(iouring): gate SEND_ZC by payload size (#332)

@1024c

#341 made the io_uring bias safe by DISABLING it, but that left adaptive parked on epoll at high concurrency (it starts on epoll), missing io_uring's measured +6.8% @1024c. Restore the win SAFELY by making the bias reversible: - the ACTIVE score is always the pure measurement (no reinforce/penalty), so leaving an engine is decided measured-vs-measured; - biasModeledStandbyScore boosts ONLY the io_uring standby (EXPLORE) and returns 0 for the epoll standby (never models it down) — so a wrongly- explored io_uring always REVERTS on measurement; - history records the unbiased measured score. Net: adaptive explores io_uring when the workload model favors it, keeps it only if it measures faster, and reverts otherwise. The 15% switch threshold + oscillation lock provide hysteresis (no thrash). Safe ON by default; CELERIS_ADAPTIVE_IOURING_BIAS=0 forces the conservative measurement-only controller (supersedes #341's default-off). Validated: full adaptive suite on linux/amd64 incl new explore/revert/ kill-switch/stability-under-fluctuation tests; cluster end-to-end confirmed — adaptive explored epoll→io_uring under sustained 1024c (892k→io_uring) and reverted to epoll when load stopped (0 throughput). Switch latency ~30s (deliberate observe-before-act; sustained-load win, tuning follow-up).

…e-bias feat(adaptive): reversible io_uring bias, default on (#338)

Blob assembled its response header list (content-type + content-length + user headers) via make([][2]string, 0, total) on EVERY response whose total exceeds respHdrBuf's 16 slots. An allocation profile of chain-fullstack (18 headers) showed this was the DOMINANT per-request alloc — ~77% of all allocations, ~1.16 GB/s → GC pressure → the throughput cost. (get-json and other <=14-user-header responses already hit the alloc-free inline fast path and are unaffected.) Reuse a per-Context blobHdrScratch (alloc once per pooled Context, not per request), mirroring respHdrScratch (#360). respHeaders never aliases it (separate buffer; the append copies the [2]string values). A/B (interleaved, 2 rounds, vs baseline): chain-fullstack +4.4% (iouring-h1- async) / +5.0% (epoll-h1-sync); get-json neutral (-0.2%/+0.1%, control — it never enters this path). Full root race suite + new TestContextBlobManyHeadersZeroAlloc (0 allocs/op) pass.

…ratch perf(core): reuse blobHdrScratch for >16-header responses

…s off (#338) BEHAVIOR CHANGE. secure.New() no longer emits Cross-Origin-Embedder-Policy (require-corp) or X-Download-Options (noopen) by default — both are now opt-in. COEP=require-corp by default is a footgun: it blocks cross-origin resources (images/scripts without CORP/CORS), silently breaking many sites — the config's own comment warned about it, yet it was on by default. Helmet leaves COEP off for this reason; we now match. X-Download-Options only ever affected legacy IE and is obsolete. Set either field explicitly to re-enable. Default header count 11 -> 9 (HSTS still runtime-gated to HTTPS). Beyond the footgun fix, the smaller response flips chain-security to a WIN vs fasthttp (-1.2% -> +0.7%) and improves chain-fullstack (-6.0% -> -4.9%); chain-api (no secure mw) unchanged. secure suite + middleware + conformance pass; new coep/x-download opt-in test cases added.

fix(secure): default COEP + X-Download-Options off (opt-in) (#338)

…cv bodies A keep-alive connection that handled even one fixed-length body split across recvs was permanently promoted to the async dispatch goroutine (worker.go HasPendingData gate), then served every subsequent request via a blocking unix.Write + cross-goroutine condvar handoff instead of the inline io_uring linked SEND. Under sustained small-POST load ~11% of requests split, so essentially every long-lived conn was poisoned onto the slow path within its first few requests. A fixed-length body in progress resumes via the inline re-parse path (ProcessH1 bodyNeeded>0), which runs on the worker that owns h1State and is already async-checked (provably non-async) — exactly like the sync engine. Only buffered partial headers / chunked bodies genuinely need the InlineMode=false dispatch path. Split the gate: HasPendingDispatchState promotes for buffered-headers/chunked only, never for a fixed body. Also tighten pickRecvTarget: gate the zero-copy direct-into-bodyBuf recv bail on (w.async && cs.asyncPromoted) rather than blanket w.async, so inline-owned conns get the zero-copy body recv the sync path already uses. The worker still owns h1State for non-promoted conns, so no new races (this strictly REDUCES cross-goroutine handoff). Async-marked routes are still promoted at the fresh-parse site before the body, and partial headers still re-run the async check on completion.

The optionalTier was the only path that set IORING_SETUP_SQPOLL, reachable when the provided-buffers probe fails on an otherwise-High kernel. That path is doubly broken: (1) celeris runs one ring per worker, so SQPOLL spawns one kernel poll thread per worker — N spinning cores that starve the workers (measured -83% throughput, 75% idle CPU on a 16-worker box); (2) the dormant SQPOLL submit path has a latent SQ-tail-publish race in Ring.GetSQE (the shared tail is advanced before the SQE payload is written, safe only because io_uring_enter is the sync point on the non-SQPOLL path). optionalTier now uses the task-run completion model like highTier (DEFER_TASKRUN|SINGLE_ISSUER, or COOP_TASKRUN on 6.0), keeping provided buffers / multishot / SEND_ZC but never SQPOLL. SQPollIdle returns 0 so the worker SQPOLL branch is unreachable. Documented the GetSQE SQPOLL-unsafety so any future SQPOLL work fixes the tail-publish first. Test updated to assert the new contract (task-run, never SQPOLL).

…omotion fix(iouring): close post-4k gap — stop sticky async-promotion on split-recv bodies

fix(iouring): never auto-select SQPOLL (#377)

#356 adaptive promotion was terminal: once a route accumulated adaptivePromoteStreak slow inline runs it was pinned to async dispatch forever (adaptiveLearning returns false for promoted routes, and a promoted route runs async so it is never re-timed). A CPU-bound chain whose inline WALL-CLOCK briefly crossed adaptivePromoteThreshold under transient worker contention (not actual blocking) was therefore stuck on the ~32%-slower async path until process restart — the intermittent chain-fullstack collapse. Promotion now expires after adaptivePromoteTTL (5s): isPromoted drops the route from the promoted set and resets its slow streak once the stamp is older than the TTL, so the next request runs inline again and is re-timed. A genuinely-blocking route re-promotes within adaptivePromoteStreak runs; a transient false-positive stays inline and re-settles. The clock (nowNano, a test-stubbable package var) is read only for routes already in the promoted set, so the inline/learning/ settled fast paths are unchanged. Tests: promotion expires + slow-streak reset; de-promoted route settles when fast; still-blocking route re-promotes after expiry. NOTE: this reverts promotion at the ROUTING layer. A connection already promoted to its async dispatch goroutine (cs.asyncPromoted) stays there until it closes — the worker owns recv but the async->inline conn handoff is separate, larger work. So long-lived keep-alive conns recover on reconnect / new conns; full in-place conn recovery is a follow-up.

…otion (#364) Completes the celeris#364 fix. PR's first commit made ROUTE promotion reversible (TTL); this makes the per-CONNECTION promotion reversible too, so a long-lived keep-alive conn that was promoted to its async dispatch goroutine returns to the inline fast path when the promoting route de-promotes — without it, such a conn stayed on the ~32%-slower blocking-write+handoff path until it closed (the bench scenario only recovered on reconnect). Mechanism: - The worker records the route that forced promotion (h1State.CurrentRoute, single-shot recv only) before starting the dispatch goroutine. - The dispatch goroutine, at its idle park point (asyncInBuf drained, last response written, no partial request), checks canRevertToInline: route's RouteAsync now false. If so it clears asyncPromoted and exits; the worker already owns recv and resumes the inline fast path on the next CQE. - cs.asyncPromoted becomes atomic.Bool: the goroutine clears it while the worker reads it on the recv hot path. The worker's feed path re-checks it under asyncInMu (the same lock the goroutine clears it under) before appending to asyncInBuf, closing the feed-vs-revert race. Only at a clean request boundary (HasPendingData false) so h1State ownership flips back to the worker exactly as for a fresh inline conn; #256 bodyRecvPin retained. Tests (engine, linux): TestAsyncConnRevertsOnRouteDepromotion proves revert via re-promotion (a still-async conn cannot re-promote); TestAsyncConnRevertRace hammers promote/feed/revert/re-promote across 64 keep-alive conns. Both pass under -race; full async-churn UAF suite stays green under -race.

…-promotion fix: fully reversible adaptive promotion — route TTL + connection revert (#364)

The load-based worker scaler (pause/resume by connection count) is obsolete on kernel 7.0+ — its concentration premise has reversed (more workers win at every concurrency, 4-core and 16-core alike) and its down-scale strands keep-alive throughput on surge-after-quiet (-31% on get-simple-1024c in the harness sequence). Removing it recovers that throughput with no regression. Worker pool is now static = numCPU (Resources.Workers), all workers always active; the adaptive engine's accept-pause/suspend lifecycle is unaffected.

Pick the START engine from probed io_uring capabilities (chooseStartEngine): io_uring on bundles-era (6.10+) or the 6.1 fast tier (DEFER_TASKRUN+SINGLE_ISSUER+multishot+provided-buffers), epoll otherwise. On kernel 7.0 adaptive now starts on io_uring, capturing the high-conc keep-alive throughput the old epoll-start default stranded (~+12%). Standby construction is LAZY: only the start engine is built+Listen-ed eagerly; the other is constructed on the first switch that needs it. When the start engine is already best and never switches, the standby is never built — cutting the dual-engine tax from ~7% to ~0.9% (same-binary interleaved at 1024c on 7.0). Conns-per-worker UP/DOWN switching is gated OFF in production: pinned conns never migrate, so the start engine decides keep-alive throughput, and the down-revert otherwise fired on idle/warmup dips and stranded load. The io_uring error-rate safety revert stays always-on. The conns-per-worker controller + multi-signal telemetry (conns/worker, accept rate, bytes/req via new per-engine accept/close/byte counters) are kept, gated, for a future middle-tier kernel with a real crossover (to be validated by a multi-kernel sweep). Old CPU-bias score machinery removed. Switch-safety invariants unchanged (resume-before-pause, synchronous PauseAccept/H2-dial-RST, ASYNC_CANCEL, driver-FD refusal, freezeState).

… switch Redesign the adaptive start-engine decision around connection pinning: an established conn cannot migrate between epoll and io_uring, so the START engine decides keep-alive throughput and the workload concurrency is unknowable at Listen() time. chooseStartEngine now gates only on t0-knowable facts: env override -> io_uring not viable (kernel fast-tier AND RLIMIT_MEMLOCK can fund the workers) -> Protocol==H2C -> WorkloadHint==HighConcurrency -> default. The default flips from io_uring-on-modern-kernels to EPOLL: every server ramps from zero connections (the low-concurrency regime where epoll wins on both throughput and tail latency). io_uring starts only on an explicit WorkloadHint=HighConcurrency (new operator field) when kernel + memlock allow. New helpers/fields: - iouring.MaxWorkersForMemlock(): the memlock worker ceiling, exported so the start decision avoids io_uring's silent 1-worker collapse proactively, not just on construction failure. capWorkersToMemlock derives from it. - resource.Resources.WorkloadHint + root celeris.WorkloadHint (Unspecified/LowConcurrency/HighConcurrency). Runtime switch (controller) re-enabled but constrained: only on the epoll-start path with io_uring viable and a non-h2c protocol, it promotes NEW connections to io_uring when conns/worker sustains the crossover. Pinning keeps the switch inert for a pure keep-alive burst; it helps ramps/churn. The load-driven down-revert is disabled (pinning makes it harmful); the io_uring error-revert stays always-on. Thresholds tuned to the epoll-vs-io_uring sweep: up 20->24, high-watermark 32->48, large-payload suppression 16384->8192 bytes. Empirical basis (msa2-server, kernel 7.0, real NIC): epoll wins <=32 conns (+~20%, ~40% lower tail), tie 64-256, io_uring wins >=~384 conns (~24/worker, +8-13%); io_uring's edge is h1-small-payload only (h2c/large payloads tie) and collapses under low RLIMIT_MEMLOCK (1 worker ~= 1/5 throughput). Validated on the cluster: resource/adaptive/iouring/epoll/root suites pass on real io_uring; default adaptive starts epoll, env/hint force io_uring, and a 1024c load fires the epoll->io_uring switch. Cross-engine connection migration (transplant pinned conns) deferred to a v1.6.0 spike (#383): only H1-idle epoll->io_uring is feasible; H1-mid-request and H2 are impossible under the current parser/HPACK/stream architecture.

…+ UsesDriver Close the async/sync handler review's footguns: #1 Server.AsyncHandlers() reports the EFFECTIVE async state (config.AsyncHandlers || router.hasAsyncRoutes()) instead of the raw config flag. WithEngine drivers select their netpoll-park fast path from this, so the recommended "AsyncHandlers=false + mark DB routes .Async()" idiom no longer silently drops the driver onto its slow busy-spin mini-loop. Only the driver registry consults this method, so the change is targeted. The value is read at driver construction, so open WithEngine drivers AFTER registering .Async() routes (or set AsyncHandlers=true); documented on the field. #3 adaptiveBlockingThreshold (2ms): a single unambiguously-blocking inline run promotes the adaptive route immediately (router.promoteRouteImmediate), skipping the adaptivePromoteStreak (8) hysteresis, so a forgotten-.Async() blocking handler stalls a worker for at most one request. The 300us/8-streak path still owns the 300us-2ms band; a CPU chain cannot cross 2ms under jitter. #4 Route.UsesDriver() - intent-revealing alias for .Async() on driver routes. #2 Config.AsyncHandlers doc rewritten (effective-state behavior, construction- order caveat, recommend AsyncHandlers=true OR .Async()/.UsesDriver() for driver routes; the adaptive net only auto-promotes handlers slower than 300us). Driver benchmark (msa2-server, 128c, footgun config = AsyncHandlers=false + per-route .Async(), before vs after #1): the fix recovers the async win and ~halves p99, matching the global-AsyncHandlers=true fast path - redis iouring 87k->107k (+23%) p99 2.8->1.4ms ; epoll 86k->107k memcached iouring 92k->136k (+48%) p99 3.0->1.2ms postgres iouring 62k-> 84k (+35%) p99 3.1->1.9ms Tests: 8 white-box unit (async_improvements_test.go) + 1 epoll integration (async_promote_integration_linux_test.go: single >2ms run promotes immediately); full celeris + driver suites green on real io_uring.

celeris.Version was stale at 1.5.0 (unchanged since the 1.5.0 tag, so even v1.5.2 shipped "1.5.0"). Bump to 1.5.3. Bump the four publishable middleware submodules' `require github.com/goceleris/celeris` from v1.4.4 to v1.5.3 so the tagged submodules resolve the matching core (release.yml's tag-submodules job warns when they drift); the local `replace => ../../` keeps in-tree builds against the live core.

The Supported Versions table + policy still named the 1.4.x line as supported even though 1.5.0-1.5.2 had shipped. Mark >= 1.5.0 supported, 1.4.x and earlier unsupported, and add a v1.5.x Security Improvements section (io_uring gen-tagged-CQE UAF hardening, H1 request-smuggling hardening, cpuMon + epoll-detach data-race fixes, middleware/secure COEP + X-Download-Options default-off behavior change, Go 1.26.4 toolchain bump).

TestCheckCapSysNiceIsSideEffectFree lived in the untagged probe_test.go but called getNice(), which is defined only in probe_caps_linux_test.go (//go:build linux). The package failed to compile on darwin/non-linux, so `go test ./...` / `go vet ./...` broke for local devs (CI was unaffected: linux jobs compile it, the macos job only runs go build). Move the test next to getNice under the linux tag.

- middleware/metrics: prometheus/common v0.68.1 -> v0.69.0 (the published submodule's only outdated direct dep; supersedes dependabot #381). - test/drivercmp/memcached + test/perfmatrix: bradfitz/gomemcache -> latest; perfmatrix also goceleris/loadgen v1.4.8 -> v1.4.9 (test-only modules). - CI: actions/checkout v6 -> v7 across ci/release/drivers (supersedes dependabot #382). Root module + the other three published middleware modules already pin current direct deps. metrics builds+tests green on 0.69.0; actionlint clean.

…sting) The adaptive package is linux-tagged, so golangci-lint only sees it on the linux CI runner — and milestone/v1.5.3 never had a prior CI run (ci.yml gates on PR/push to main), so four findings slipped in: - controller.go:92 + engine.go:137: gofmt formatting. - engine.go:256: ineffectual local 'startType = engine.Epoll' (the struct field e.startType is the one actually read at activeIsPrimary); drop it. - start_test.go:26: withMemlock param 'max' shadowed the builtin (revive redefines-builtin-id); rename to maxWorkers. Verified: GOOS=linux golangci-lint ./... = 0 issues, cross-compile build+vet clean.

FumingPower3925 added 30 commits June 17, 2026 10:12

Merge pull request #357 from goceleris/feat/v1.5.3/iouring-sendzc-opc…

dad8173

…ode-fix fix(iouring): correct SEND_ZC opcode + CQE_F_NOTIF flag — zero-copy send was dead on every kernel

Merge pull request #358 from goceleris/feat/v1.5.3/iouring-adaptive-i…

53027a6

…nline feat(core): adaptive inline-first dispatch under AsyncHandlers (#356) — iouring-async +7.5..48% on CPU-bound

Merge pull request #362 from goceleris/feat/v1.5.3/h1-no-double-heade…

62803d4

…r-scan perf(h1): drop redundant upfront findHeaderEnd block scan (#359)

Merge pull request #363 from goceleris/feat/v1.5.3/adaptive-settle-fast

c7bc678

perf(core): settle non-blocking adaptive routes to stop per-req timing (#361)

Merge pull request #365 from goceleris/fix/v1.5.3/adaptive-promote-th…

198960e

…reshold fix(core): raise adaptive promote threshold 50µs->300µs (#364 hardening)

Merge pull request #366 from goceleris/feat/v1.5.3/respheader-scratch…

4947d11

…-reuse perf(core): reuse heap respHeaders backing across requests (#360)

Merge pull request #368 from goceleris/perf/v1.5.3/h1-reset-deadstore

f0a92ac

perf(h1): drop dead-store writes in ResetH1Stream (#346)

Merge pull request #369 from goceleris/fix/v1.5.3/adaptive-bias-defau…

822859c

…lt-off fix(adaptive): gate io_uring bias off by default (#341)

Merge pull request #370 from goceleris/perf/v1.5.3/epoll-nodelay-inherit

027054b

perf(epoll): inherit TCP_NODELAY from the listen socket (#337)

Merge pull request #371 from goceleris/perf/v1.5.3/iouring-sendzc-siz…

3137b78

…e-gate perf(iouring): gate SEND_ZC by payload size (#332)

Merge pull request #372 from goceleris/feat/v1.5.3/adaptive-reversibl…

3d06697

…e-bias feat(adaptive): reversible io_uring bias, default on (#338)

Merge pull request #373 from goceleris/perf/v1.5.3/blob-manyheader-sc…

1deeda2

…ratch perf(core): reuse blobHdrScratch for >16-header responses

Merge pull request #375 from goceleris/fix/v1.5.3/secure-coep-optin

ad7a466

fix(secure): default COEP + X-Download-Options off (opt-in) (#338)

Merge pull request #378 from goceleris/fix/v1.5.3/post-body-sticky-pr…

7c5e637

…omotion fix(iouring): close post-4k gap — stop sticky async-promotion on split-recv bodies

FumingPower3925 added 13 commits June 17, 2026 19:53

Merge pull request #379 from goceleris/fix/v1.5.3/disable-sqpoll-377

63f4110

fix(iouring): never auto-select SQPOLL (#377)

Merge pull request #380 from goceleris/fix/v1.5.3/reversible-adaptive…

2fcc9a4

…-promotion fix: fully reversible adaptive promotion — route TTL + connection revert (#364)

FumingPower3925 merged commit e4da508 into main Jun 21, 2026
31 checks passed

FumingPower3925 deleted the milestone/v1.5.3 branch June 21, 2026 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

celeris v1.5.3 — core-engine performance milestone#384

celeris v1.5.3 — core-engine performance milestone#384
FumingPower3925 merged 43 commits into
mainfrom
milestone/v1.5.3

FumingPower3925 commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FumingPower3925 commented Jun 21, 2026

Release-prep (this PR's tail)

Milestone highlights (the 38 commits)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant